Reanalysis shows there is not an extreme decline effect in fish ocean acidification studies

This Formal Comment uses re-analysis after appropriate corrections to claim that the extreme decline effect reported by Clements et al. is a statistical artefact caused by the way they corrected for zeros in percentage data, exacerbated by errors in data compilation, selective data inclusions and missing studies with strong effects.

treatments, see below) and with missing data included (Table A in S1 Text), Fig 2 shows how the decline effect is driven by the choice of replacement values used in percentage and proportional data. When zero values are replaced with 0.0001, the complete, corrected, and screened data set exhibits a decline in effect size that is not dissimilar to that originally reported by Clements and colleagues (Fig 2A and 2B), except that the initial decline is less steep (Fig 2C), and the variance-weighted average effect sizes are noticeably higher in 2018 to 2019 compared with the original (Fig 2D). However, the decline effect is markedly flatter (Fig 2E), and the magnitude of weighted average effect sizes in 2009, 2010, and 2014 are substantially smaller (Fig 2F) when 0.1 is used to correct for zero values in percentage and 0.001 for proportional data. The decline effect is even flatter (Fig 2G), and weighted effect sizes in 2009, 2010, and 2014 are smaller again (Fig 2H), when zero values in percentage data are replaced with the Original analysis by Clements and colleagues using 0.0001 to replace zero values in percentage and proportional data and (c, d) reanalysis with the corrected, updated, and screened data set using 0.1 to replace zero values in percentage data and 0.001 to replace zero values in proportional data. Top row (a, c) shows all calculated effect sizes (lnRR) fitted with a Loess curve and 95% confidence bounds. Bottom row (b, d) shows the modelled variance-weighted average effect sizes by year. Experiments with smaller variance are given greater weight in calculating the model means in the bottom row. Panel (e) shows how effect sizes are exaggerated when values below 1 are used to replace zeros in percentage data. The calculated effect size magnitude (lnRR = ln(treatment mean/control mean)) is shown for treatment means of 100% and 50% and control mean values between 5% and 0%, illustrating the inflation of lnRR caused by values below 1% in the denominator. The data underlying this figure (a-d) may be found in https://doi.org/10.25903/jw8m-9007. The data underlying panel (e) are found in S1 Data.
https://doi.org/10.1371/journal.pbio.3001809.g001 smallest whole number (1) and 0.01 for proportional data. From this comparison, it is clear to see that Clements and colleagues claim of an extreme decline effect is a statistical illusion driven by their method of correcting for zero values in percentage data. Indeed, Lajeunesse [2] warns that "log-ratio effect sizes estimated with RR are at the greatest risk of bias when: (1) the means have small sample sizes, (2) the two means are not close to one another, and (3) at least one of the control and treatment means is near zero" all of which apply here.
Clements and colleagues [1] analysis also contains data handling errors, improper data inclusions and exclusions, and missing studies (Table A in S1 Text), all of which exaggerate the decline effect.
A preliminary check of Clements and colleagues' data reveals data entry errors and incorrect values that cause effect sizes to be lower than the true value for studies after 2010. The feeding strikes data for McMahon and colleagues [3] does not match the figure or the underlying raw data, and there are errors in the reported N values, despite the correct data being publicly available online since publication. There are also mistakes in the coding of cue type and life stage of some studies ( Table A in S1 Text), as well as incorrect values in the year of publication online and print columns for numerous files (see Methods in S1 Text). These mistakes illustrate how easy it is to make unintentional data handling errors in large, complex data sets, even by authors who have been highly critical of others for doing just that.
Another problem that artificially diminishes effects sizes in papers after 2010 is the inclusion of sham treatments in the calculation of OA treatment effect sizes. Sham treatments, such as the injection of blank seawater with no additional stimulus, are often used in studies that measure the change in behaviour after a stimulus (e.g., predator or alarm cue) is presented, compared to a prestimulus period. Sham treatments are predicted to have no or very small effects if an experiment is working properly. By including these methodological controls as experiments in their analyses, Clements and colleagues have artificially diluted the effect size for several studies conducted after 2010 (Table A in S1 Text). Yet, they removed procedural controls (tests with seawater on both sides of the flume) from studies in 2009 and 2010, thereby increasing the average effect size for those years.
Clements and colleagues also chose to exclude results where there was a different direction of responses between the control and the OA treatment (Table A in S1 Text). The problem here is that these are often the stronger results directly attributable to OA effects, precisely because the treatment effect goes in the opposite direction to the control. For example, the 3 species for which strong OA effects are observed at 850 ppm CO 2 are excluded in the data set for Ferrari and colleagues [4], leaving only the 1 species that was found to be much more tolerant of elevated CO 2 in the analysis. By excluding some of the strongest effects, while retaining weaker effects from the same experiments, Clements and colleagues have exacerbated the decline in effect size of experiments immediately after 2010.
A further issue is the inclusion of treatments that diminish the magnitude of OA effects, such as fluctuating CO 2 treatments, which were not included in the original studies (Table A in S1 Text). For example, Jarrold and colleagues [5] showed that daily CO 2 cycles greatly diminish the behavioural effects of OA compared with stable elevated CO 2 treatments used in earlier studies. By including these treatments in their analysis, Clements and colleagues diminish the average effect size that would otherwise be attained.
Finally, some experiments and recent studies [6][7] with strong effects are missing from Clements and colleagues' data set (Table A in S1 Text). Two studies [8][9] for 2019 are also missing, despite the figures portraying a full year. The absence of these studies causes the mean effect size estimated by Clements and colleagues for 2018 to 2019 to be lower than it should be (original versus reanalysis (0.0001) 2018: 0.443 versus 1.164, 2019: 0.088 versus 0.356). Moreover, the mean effect size in 2019 does not fall to zero when the data set is complete (Fig 1C  and 1D).
Without doubt, there has been a decline through time in the averaged effect size from experiments investigating the behavioural effects of OA on fish, but it is not the extreme decline erroneously reported by Clements and colleagues. A decline in effect size is not surprising as more and different species are tested, some of which will be much less sensitive to the effects of OA than the orange clownfish, which was the first species tested in this field of study (e.g., [10]). Furthermore, an increasing range of different behaviours have been tested through time, many of which are less affected by OA and generate smaller effect sizes than the initial effects of OA on the response of larvae to highly concentrated predator odour and habitat cues. Methods have also changed through time, in ways that reduce effect sizes compared with the earliest studies in the field [11]. Decline effects occur in many areas of science, including ecology [12], yet their analysis and interpretation is still a work in progress [13][14].  Table A. Data errors identified in a non-exhaustive preliminary check of Clements and colleagues' S2 data file, along with incorrect inclusion of sham treatments, missing data, and exclusions that were corrected to enable analysis. Highlight refers to the colour used to show the relevant lines of data in the screened, corrected, and complete data file used in the reanalysis. Data files available at https://doi.org/10.25903/jw8m-9007. (DOCX) S1 Data. Data associated with Fig 1E. Calculated lnRR for control values ranging from 0.0001%-5% for treatment values of 50% and 100%. (DOCX)